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) The method of the present invention com- 
bines concept searching, document ranking, 
high speed and efficiency, browsing capabili- 
ties, "intelligent" hypertext, document routing, 
and summarization (machine abstracting) in an 
easy-to-use implementation. The method of the 
present invention also offers Boolean and 
statistical query options. The method of the 
present invention is based upon "concept in- 
dexing" (an index of "word senses" rather than 
Just words.) It builds its concept index from a 
"semantic network" of word relationships with 
word definitions drawn from one or more stan- 
dard human-language dictionaries. During 
query, users may select the meaning of a word 
from the dictionary during query construction, 
or may allow the method to disambiguate words 
based on semantic and statistical evidence of 
meaning. This results in a measurable improve- 
ment in precision and recall. Results of search- 
ing are retrieved and displayed in ranked order. 
The ranking process is more sophisticated than 
prior art systems providing ranking because it 
takes linguistics and concepts, as well as statis- 
tics into account. 




Figure 1 
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Field of the Invention 



thP n^nTH ' S 8 mBth0d ** com P utar -*««' information retrieval. Specifically, the method of 

invent™ offers four advances in the art of computer-based text retrieval. First, querying^ slmpte. SSS 
^ tVET -- in P ' ain En9,i8h (W in another 8uitab,e hum «" tanguage). Second" sTcZ "for W 
prior art Third, the method of the present invention is more efficient than sophisticated text retrieval methods 

r 0 ^^Ll^n^t ( ^' U ^ en, featUre3 * ™TS£Idt 

fISL tLTmLrJ^KT T, SyStemS> and approximately 50% for statistical methods of mepriorart 
Finally, the method of the present invention manages the entire research process for a user. 

Background of the Invention 

aro ^ tnere are dozens of information retrieval software systems commercially available, most of them 
are based on older Boolean search technology. A few are based on statistical sear* techniques^ hale 

ZZ in t^Tlf But - to ^ the «-"<» to access to reievant information and to p^Uhte Zr 
ma^mthelhands^e^usersattto 

S^^Sft^ needs a minimum investment of time by the user. The folowing distinctive features and' 
benefits delineate these signif icant aspects of the method of the present Invention 
To date, there have been three major classes of text retrieval systems: 

• Keyword or Boolean systems that are based on exact word matching 

• Statistical systems that search for documents similar to a collection of words 

• Concept based systems that use knowledge to enhance statistical systems 

Keyword or Boolean systems dominate the market These systems are difficult to use and nerfnrm n^iw 
fealty 20% recall forisolated queries). They have suoceedeJLy t^^^^Z^Z 

IstT^TZZT^ Performance to near50%recai,. trained search expertise 

is still needed to formulate queries in several ways to conduct an adequate search 

T« • aSe t 8 f a ^ 8yStem fUrth6f 0)0863 the Performance gap by adding'knowledge to the system 

To date, there is no standard way to add this knowledge. There are very few concept based search sk*mk 

^jseszt^ intensive manua ' 01 the ^ssr systems 

wmi 7Z ^ '^o-roctKHiforimprovementin text retrieval is its use of Natural Language Processing (NLP) 
nalb^ ToT ZT r^"" 6 " 181 in 9°-™"^ development programs. moS oHh^ZyS 

SZ^SLSSTS ? ""^ areaS " they mn ^ and they are incomplete and unsutebteTr 

oommercialaation. The failure of many early research prototypes of NLP based text retrieval systems has led 
to much skepticism in the industry, leading many to favor statistical approaches 

tart ^Z!Tt£1T ^Tl'TT in ,hB reSearCh «™ nit y in ^e combination of NLP and conventional 
SSIm ' S * the 8r ° W '' n9 number of workshops on the subject The Arr^can AsscSn 

of Artifioai Intelligence sponsored two of them. The firstwas held atthe 1990 Spring Al Symposium^tentoa 
University on the subject of "Text Based Intelligent Systems". The second one (chaired bv th^T^r^l^ 
was held at AAAJ-91 in Anaheim in July 1991 ( by theapplicantheren) 

Natural Language Techniques 

The literature is rich in theoretical discussions of systems intended to provide functions similar to those 
outlined above. A common approach in many textbooks on natural language process^ g TtetJreJ ^ 

ff^ jr^ Benjam,n O"** 1987) is to use^ser^r^rp^S^S 
denhfy the meanings of words in text Such systems are "hand-crafted", meaning that ne3es n^s^bTwn? 
ten for each new use. These rules cannot be found in any published dictionary or ntoZnZ^^Z- 
proach ,s rarely employed in text retrieval is usually fails in some critical waylo provide adequate resets 

Krovetz has reported in various workshops (AAAI-90 Spring Al Symposium at Stanford University) and in 
Lexical Acquisition by Uri Zernick, Lawrence Erlbaum, 1991. ISBN 0-8056-0829-9, that "disambiguating word 
senses from a dictionary" would improve the performance of text retrieval systems cteZgtSlIS 
r e £T° !. hat thiS m6th0d Wi " improve P recisfon - ™a author's philosophy suggeSs to 
identified by "confirmation in context from multiple sources of evidence". None tfKrevetfs SZte 

f X^L technique for doing 801 and hfe recent pub,ications indicate that he fe - 
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T M ,2 miaK ' ° f BrOW " University has re P"ted in "Al Magazine" (AMI, Winter 1992), and has spoken 
aMheNa^eareh Laboratory Al Laboratory (Itoveniber^lJalKJutthetachniqueofempl^nfl-apm^ 
actuation to .dentrfy the meaning of a word in a small text Charniak employs a "semantic network" and oe- 
gins wrth al Instances of a given word. It then "fans out" in the network to find neighboring terms that are lo- 
cated near the candidate term in the text This technique suffers from 2 admitted drawbacks: it requires a high- 
qualrty partally hand-crafted, small semantic network, and this semantic network is not derived from pun- 
ished sources. Consequently, the Charniak method has never been applied to any text longer than a few sen- 
tences in a highly restricted domain of language. 

m^?S a JT " a ?° fthe ""^reity of North Carolina, has attempted to use multiple dictionaries in infor- 
mation retrieval Including a main English dictionary coupled with a vertical application dictionary (such as a 
dictionary trf computer terms used in a computer database). Haas" approach does not take advantegeoVvSrd 
sense cUsambiguatlon. and aha reported at ASIS. October 1991 that merging two dictionaries gaZomS 
urable increase in precision and recall over a single generic English dictionary. 9 

^u^r^l^T * L T! al Ac ^ uisit ' on - Lawrenc8 Erlbaum. 1991. suggests in the same book a "cluster 
signature method from pattern recognition be used to Identify word senses in text The method lists words 
commorfy co-occurring with a word in question and determines the percentage of the time that each of the 
t^^ZTZ^*™™, ~ nteXt thS ******* ° r ^ each word meaning. This is called 
L iS^Lt^? W ° rd T 168 ^ 71,8 ^ures of each meaning are compared with the use of a word 
" r^.^^^ ^!!^!! ^ 7)1,8 P**™ re^"^" approach based upon a cluster technique discussed 
J ^ y Pattern aaaslflcetion^^n., a^,. lohnWItey % nnnn How Ynrk Itin |,i,S 



„ "^nw, ^ ^ b y above), discusses use of a "subject hierarchy" to compute 

. '! a j m . 9Uate " 0rd senses - Genera "y- a "subject" or topic Is identified by the context A 
T2k, IT* * ftS relevance to the topic This approach is only as strong as the depthrfihe 

subject hierarchy and It does not handle excepttor*. A drawback of this ap^ 

speakeTo^Ia^^ 

One well known example of prior art in text retrieval that uses natural language input is the statistical tech, 
mques developed by Gerard Saltan of Cornell University. His research system calledSMAOTb now^sed ta 
commerc.^ I applteattons. for example. Indivk.ua. .nc of Cambridge. MAusls it in a ntidippi^^ri 
Saltan is well known for his claims that natural language processing based text retrieval sySd^fwo* 
V^Z daimS ° n " mited experi'nenta that he ran in the 1960?Se 19^1 aSs 

meeHng he stated that the reason natural language processing based systems don't work is that syntax Is 
required and syntax » not useful without semantics. He further claims that "semantics is not available" due 
to the need to handcraft the rules. However, the system of the present invention has made semantics available 
iTcneto^s ^ Pn>CeSSin9 ° n maChine readaWe dictionaries and automatic acquSoTofle! 

toxical Acquisition 

In the ftetd of lexical acquisition, most of the prior art is succinctly summarized in the First Lexical Acqul- 

^"^^£^ Cee ^ nqS ' A " 9USt 1 **' D6troit * WCAI-89. There is a predominance^pers coving 
the automatic buHduig of natural language processing lexicons for rule-based processing. Over SOtoaZs 

SlTfor"^, i" T^!^ ***** C ° nCePtS w pretot * H » tor acquiring information7rom electrentT 
Indexing 

Typical text search systems contain an index of words with references to the database For a large docu- 
ment databases, the number of references for any single term varies widely. Many terms may have only one 
reference while other terms may have from 100,000 to 1 million references. The prior art substitutes (the- 
saurus ; entnes for search terms, or simply requires the user rephrase his queries in order to "tease information 
out of the database The prior art has many limitations. In the prior art. processing is at the level of words, 
not concepts Therefore, the query explosion produces too many irrelevant variations to be useful in most cir- 
cumstances. In most prior art systems, the user is required to restate queries to maximize recall. This limits 
such systems to use by "expert" users. In prior art systems, many relationships not found in a classical the- 
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saurus cannot be exploited (for example, a "keyboard" is related to a "computer" but it is not a synonym). 
Contextual Systems 

The prior art of systems which attempt to extract contextual understanding from natural language state- 
ments is pnmarily that of Gerard Saltan (described in Automatic Text Processing . Addison-Wesley Publishing 
Company, 1989.) As described therein, such systems simply countterms(words) and co-occurrences of terms 
but do not "understand" word meanings. 

Routing means managing the flow of text or message streams and selecting only text that meets the de- 
sired profile of a given user to send to that user. Routing is useful for electronic mail, news wire text, and in- 
telligent message handling. It is usually the case that a text retrieval system designed for retrievaJ from archived 
T*J* not good for routing and visa versa For news wire distribution applications (which seek to automate 
distribution of the elements of a "live" news feed to members of a subscriber audience based on "Interest pro- 
files"), it is time-intensive and very difficult to write the compound Boolean profiles upon which such systems 
depend. Furthermore, these systems engage in unnecessary and repetitive processing as each interest pro- 
file and article are processed. 

Document Ranking 

Systems which seek to rank retrieved documents according to some criterion or group of criteria are dis- 
cussed by Saltan, in Automatic Text Processing (ranking on probabilistic terms), and by Donna Harmon in a 
recent ASIS Journal article, (ranking on a combination of frequency related methods). Several commercial sys- 
tems use ranking but their proprietors have never disclosed the algorithms used. Fulcrum uses (among other 
factors) document position and frequency. Personal Library Software uses inverse document frequency term 
frequency and collocation statistics. Verity uses "accrued evidence based on the presence of terms defined 
in search topics". 

Concept Definition and Search 

The prior art comprises of two distinct methods for searching for "concepts". The first and most common 
of these is to use a private thesaurus where a user simply defines terms in a set that are believed to be related. 
Search.ng for any one of these terms will physically also search for and find the others. The literature is replete 
with research papers on uses of a thesaurus. Verity, in its Topic software, uses a second approach. In this 
approach users create a "topic" by linking terms together and declaring a numerical strength for each link, 
similar to the construction of a "neural network". Searching in this system retrieves any document that con- 
tainssuf f icient (as defined by the system) "evidence" (the presence of terms that are linked to the topic under 
search). Neither of these approaches is based upon the meanings of the wonts as defined by a publisher's 
dictionary. 

Other prior art consists of two research programs: 

• ""PSTER: Agovernment research program called TIPSTER is exploring new text retrieval methods. This 
work will not be completed until 1996 and there are no definitive results to date 

• CLARIT: Carnegie Mellon University (CMU) has an incomplete prototype called CLARIT that uses dic- 
tionaries for syntactic parsing information. The main claim of CLARIT is that it indexes phrases that It 
finds by syntactic parsing. Because CLARIT has no significant semantic processing, it can only be 
viewed as a search extension of keywords into phrases. Their processing is subsumed by the present 
invention, with the conceptual processing and semantic networks. 

Hypertext 

Prior art electronically-retrieved documents use "hypertext", a form of manually pre-established cross- 
reference. The cross-reference links are normally established by the document author or editor, and are static 
for a given document When the linked terms are highlighted or selected by a user, the cross-reference links 
are used to find and display related text 

Machine Abstracting 

Electronic Data Systems (EDS) reported machine abstracting using keyword search to extract the key sen- 
tences based on commonly occurring terms which are infrequent in the database. This was presented at an 
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American Society for Information Systems (ASIS) 1991 workshop on natural language processing Thev fur 
ther use natural language parsing to eliminate subordinate clauses Processing. They fur- 

cepts, not Just keywords. In addition, the present invention uses semantic networks to fuVthTatet^th^ 
concepts to gain some general idea of the intent of the document **** 



Summary 



fl „H rotrttMjoT Z T* b™™"*** * the shortcomings of Pnor art systems for textual document search 
I^SSl Mo l commeretel systems of the prior art rely on "brute force indexing" and wortoTwIld^S 
search wtdch provides fast response only for Itets of documents which are ranked Sordino K^nS 

****** "** ** me entlre search to complete before any information is produced Atten^iveTa!** 
systems disptey documents quickly, but without any guarantee" that docun^nte^^ed^TmostT 

The systems of the prior art rank documents retrieved on the presence of words not word nraninn* t»,« 
pnor art systems fail to use linguistic evidence such as syntax or s^tic dbte^inS,Z?^ 

are^ a^rZfil"^ ^ diCti0na,1eS - and thu8 ' toZextenS such Zo^ 

In thesaurus-based Information retrieval systems, as well as topic based information retrieval svstema 
by .inking wonfa. not word meanings. In these syste^es^^ 

ba^^te^^, ^fi 9 ^ T™* ass| 9" ments to to Pte definitions. Prior art thesaurus and topic 
Trv^iS^r to an entire network <* ~noapts 1" the natural language ofwarth 

Instead, isolated term groups are created that do not connect to the remainder of any c<mceot tooTedo^«' 

tua. dSl^^ n>e^ ^f!! tem8 ■ aUlh0rS n68d not ^ ««» coding hypertext linkatopSenTa hypertax- 

Brief Description of the Invention 

The method of the present invention combines concept searching, document ranking htah soeed end «f 

The method offers three query options: 
Natural Language: finding documents with concepts expressed in plain English- 
Query by Example: Present a document, retrieve similar documents; 
Private Concept define a new term, enter it in the "semantic network", search 

The method of the present invention continues to provide Boolean and statistical' auerv oetion* *« h»» 

t ^1™'*°*°! the P«sent Invention is based upon "concept Indexing- (an index of "wort senses" rather 
£Z£. A 86086 * 8 ^ ic " -^"'"9 of a wort or idiom. Th mZdTtte preset 
Zone'o?^ 

tnTdi^/T - ^ ^ dictionarie s- Durin 9 W. users may select the meaning of a word from 

construcxiox This results in a measurable improvement in precision. 
H, Qt l^ searching are retneved and displayed in ranked order. The ranking process is more soohis- 

dfen^^Il?^" 6 Pr . 8Senl invention uses an srt' 110181 Intelligence "hill climbing" search to retrieve and 
ont i documents while the remainder of the search is still being processed The rn^t»dofthe^)re^ 

ent.nvent»n achieves major speed advantages for interactive users. inemetnodofthepres- 

documents d,rectly and moving around within and between documents by related c^mTSmZ 
dynam,ca.lycompiled"hy^ 
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Brief Description of the Drawings 

Figure 1 depicts the computer program modules which implement the method of the present inven- 

tion. 

Figures 2a-d depicts a detailed flow diagram of the concept indexing process according to the present in- 
vention. 

Figure 3 depicts the process whereby the method of the present invention disambiguates word 

senses based on "concept collocation-. 
Fjgure4 depicts the sources of information in an automatically-acquired machine-readable dictionary 

according to the present invention. 
Figure 5 illustrates the structure of the machine-readable dictionary of the present invention 

Figure 6 depicts a flow diagram of the query process according to the present invention. 

Detailed Description of the Invention 

The method of the present invention is a "Natural Language Processing- based, text retrieval method 

2^ *:™™**T* ***** 8y8temS and thos * thJaxist require intenste^nuai 

!?/i^"" der¥n9 ^ ^ ^hod of the present invention usespuWished dictionaries 

e^S^ 

J^^iS^ ° r ,an9Ua98M input the interface considerably simpler 

In the method of the present invention: 

• There are no hand-crafted rules for each word meaning 

• Idioms and repetitive phrases are processed as a single meaning 

• Unknown words, proper names and abbreviations are automatically processed 

• III formed input with poor grammar and spelling errors can be processed 

The method of the present invention has combined the document ranking procedure with the search pro 

S^t^l M aVa "^ e SyStemS flrSt rctrieve 8,1 Possible documents and then rank 

port the advanced demands of natural language text retrieval. 
In the method of the present invention: 

• Only the best documents are retrieved 

• Searching is guided by document ranking 

• The document database is automatically divided into multiple sets 

• Searching over document sets significantly improves method performance 

Architecture 

The method of th* present invention has been implemented as 5 computer program modules: the Query 

SZ™' w^^?^ the Ubrary Mana9er ' Dk * fo nary Manager, and the Integrator's Toolkit Each of 
these are denned below and their relationships are shown in Figure 1. 

• Query Program Program to accept queries and execute searches 

• Index Program Program to index new or updated documents 

• Ubrary Manager Program to manage the organization of text files 

• Dictionary Editor Program to maintain dictionary/private searches 

• ^rator-s Toolkit Program for developers to integrate the present invention with other computer 
systems and program products 

. J^thod of the present invention offers Graphical User Interfaces, command line interfaces, and tools 
to customize the user interface. The display shows the title hits in ranked order and the full text of the docu- 
ments. Documents can be viewed, browsed and primed from the interface. The Integrator's Toolkit allows the 
product to be installed in any interface format The system is an open system. It makes heavy use of "Appli- 
^.onProgram Interfaces" (APIs), or interfaces that allow it to be integrated, linked or compiled with otfer 

Natural Language Processing 

The .method of the present invention is thefirst text search system that uses published dictionaries to build 
automatically the underlying knowledge base, eliminating the up front cost that an organization must absorb 
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to use other concept based search systems. In addition, the dictionary gives knowledge needed to process 
accurately natural language Input, making the user interface considerably simpler. The algorithms used Iden- 
tify the meaning of each word based upon a process called "spreading activation". NLP as used in the present 
invention improves text retrieval in many ways, including the following: 

• Morphological analysis allows better matching of terms like "computing" and "computational" Tradi- 
tional suffix stripping hides these related meanings and may introduce errors when suffixes are improp- 
erly removed. 

• Syntactic analysis gives insight into the relationship between words. 

• Semantics resolve ambiguity of meaning (Le., chemical plant vs. house plant). 

• Natural Language may be used to interact with the user, including allowing the user to select meanings 
of words using dictionary definitions w 

Statistical Word Sense Disambiguation Using a Publisher's Dictionary 

™ e P*** 0 * 6 * tWs b t«> WenWy the specific meaning of each word in the text as identified in a 

publteher-s dictionary. The reason to do this is to increase the precision of the return during document retrieval 
♦Tl^T'" 9 " ThB ! Primarily a 8emantic " wora 881,88 disambiguation" and takes place via a "spreading ac- 
tivation concept through a "semantic network". The method used disambiguates word senses (identify wont 
meanings) based on "concept collocation-. If a new word sense appears in the text, the likelihood is that it is 
slmilarln meaning or domain to recent words in the text Hence, recent syntactically compatible terms are com- 
P !^^T? SemantiC neWwk < dfecusS8d below) by "semantic distance". A classic example is that the 
word bank when used in dose proximity to "river" has a different meaning from the same word when used 
in dose proximity to "check". 

To make this concept work correctly, an underlying semantic network defined over the word senses is 
needed. An example of such a network Is illustrated In the discussion which follows. Note that only one link 
type is used This an "association link" which will be assigned a link strength from 0 to 1. Past industrial ex- 
perience with commercial systems has shown difficulty in maintaining rich semantic networks with many link 
types. Further, this concept indexing scheme does not require a deep understanding of the relationship be- 
tween word senses. It simply must account for the fact that there is a relationship of some level of belief 
., ™l!T Sent inV8nti0n "aas a new form of statistical natural language processing that uses only informa- 
tion directly acquirable from a published dictionary and statistical context tests. Words are observed in a local 
region about the word in question and compared against terms in a "semantic network" that Is derived directly 
from published dictionaries (see discussion below on automatic acquisition.) The resulting statistical test de- 
termines the meaning, or reports that it cannot determine the meaning based upon the available context (In 
this latter case, the method simply indexes over the word itself as in conventional text retrieval, defaulting to 
keyword or thesaurus processing). 

This method overcomes all the limitations discussed above. Hand-crafted rules are not required The 
method applies to any text in any subject (obviously, in vertical subject domains, the percentage of words that 
can be disambiguated increases with a dictionary focused on that subject) No training is required and excep- 
tions outside of a subject domain can easily be identified. The significance of this method is that now any 
text may be indexed to the meanings of words defined in any published dictionary - generic or specialized 
This allows much more accurate retrieval of information. Many fewer false hits win occur during text retrieval! 

Concept Indexing 

Figures 2a-d show a detailed breakout of the concept indexing process. The process extracts sentences 
from the text tags the words within those sentences, looks up words and analyzes morphology executes a 
robust syntactic parse, disambiguates word senses and produces the Index. 

The first step in the indexing process is to extract sentences or other appropriate lexical units from the 
text A tokenizer module that matches character strings is used for this task. While most sentences end in per- 
iods or other terminal punctuation, sentence extraction is considerably more difficult than looking for the next 
period. Often, sentences are run on. contain periods with abbreviations creating ambiguities, and sometimes 
have punctuation within quotes or parenthesis. In addition, there exist non-sentinel strings in text such as lists 
figure titles, footnotes, section titles and exhibit labels. Just as not all periods indicate sentence boundaries' 
so too not all paragraphs are separated by a blank line. The tokenizer algorithm attempts to identify these" 
lexical boundanes by accumulating evidence from a variety of sources, including a) Blank lines, b) Periods, 
c) Multiple spaces, d) List bullets, e) Uppercase Letters, f) Section numbers, h) Abbreviations, g) Other Punc- 
tuation. 
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For example: 



• English only Queries 

• Fast Integrated Ranking and Retrieval 



139-23-9024 (Social security number) 
29 Jan 92 (Date) 

1 . noun, "round spherical object" 

Word Sense A9C2 (pointer into semantic network) 

2. verb, "to gather into a ball, wad" 

Word Sense A9C3 

3. noun, "dance or party, typically formal" 

Word Sense A9C4 

4. Third word of idiom #EB23, "Have a ball" 





Find dictionary 


Dictionary word 


roew wora 


If word suffix is: 


word with this suffix: 


part of speech: 


part of speech: 


ies 
ing 


y 


noun: singular 
verb: infinitive 


noun: plural 
verb: 3rd person sing. 



unr j'ZlZ Z Z 1 ,C8tK)n: A mechanism '"identifying proper nouns in text is provided because it is 
If the« issupporHTO evidence lOMa wort fa . „»per noon. a„d * Is not to the dlcMon«y, ■!„„ R „ aMume d 
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that the word is indeed a proper noun. If it is in the dictionary, the method will look for further supporting evi- 
dence by performing syntactic analysis on the word, at which time it may be declared to be a proper noun and 
is indexed as such. 

5) Idiom Processing: When a word is retrieved from the dictionary, all of the information about how that 
5 word may be used in idioms is also retrieved. This information includes the idiom in which the wort is used 

and the number of the word within the idiom. This information is used to collect the words of an idiom into a 
single concept, which is then passed to the natural language algorithms. As an example, consider, 
"have a ball" 

"have" -> Idiom # EB23, Wort #1 
10 "a w -> idiom #EB23, Wort #2 

"ball" -> idiom #EB23, Wort #3, Last wort of the idiom 
The idiom processor looks for a sequence of worts, like those above, which start with wort #1, continue 
in sequence to the last wort, and all have the same idiom number. Sequences of worts which pass these rules 
are collected into a single concept For our purposes, we use the term "idiom" in a rather loose sense, meaning 
any string of more than one wort that frequently occurs together and implies a single meaning or concept 

6) Fuzzy speD corrector When all other dictionary retrieval mechanisms have failed, the method of the 
present invention invokes a spell corrector. The spell corrector dissects the word into fragments (morphemes) 
Each fragment is used as an entry point into a network of word fragments. 

Links in the network are traversed to find other fragments which sound alike, or are spoiled alike. The 
resulting set of fragments is then used to find candidate worts in the dictionary which has been pre-indexed 
based on fragment 

This spell check mechanism is "fuzzy" because it is not based on a rigid set of rules like typical soundex 
mechanisms. Rather, it uses accumulated evidence (in much the same way as the text retrieval engine) to find 
strong candidate worts, which are then ranked by their suitability. This "fuzzlness" provides a single integrated 
mechanism for correcting worts with spelling errors, phonetic errors, and Optical Character Recognition 
(OCR) errors. 

The third step is parsing. Once the input sentence has been prepared for parsing, a robust syntactic parser 
with integrated semantic interpretation is applied. The parser Is a bottom up chart parser with unification rules 
Full sentences will always be attempted in parsing. However, when sentences are ungrammatical or unwieldy 
or when the input text string is not a full sentence, the chart parser will produce phrase or fragment parses! 
Hence, the output of the parser may be a complete sentence parse, a collection of phrase parses with missing 
attachments, or even an isolated wort group. In any case, the parser never foils (100% recovery). 

The fourth step in the processing is to disambiguate wort senses not handled by the parser. This is a se- 
mantic wort sense disambiguation and takes place via a spreading activation concept through a semantic net- 
work. Figure 3 illustrates the concept which is to disambiguate wort senses based on "concept collocation" 
If a new wort sense appears in the text, the likelihood is that it is similar in meaning to recent worts in the 
text Hence, recent syntactically compatible terms are compared through the semantic network by spreading 
activation or semantic "distance". 

An underlying semantic network defined over the word senses is used in this step.. Note that only an "as- 
sociation link" types is used (which will be assigned a link strength from 0 to 1, or a fuzzy link strength in a 
fuzzy logic implementation of the network.) 

An another example, consider the sentence "Tools are required to identify software bugs." The correct 
meaning of the work "tool" may be found by spreading activation. The nodes in the network correspond to 
work senses or idioms. The arcs contain assigned or computed weights. The significant worts in the input 
string or sentence are: tools, require, identify, software, bugs . The wort tools has two work senses: 
tooH . such as hammer or saw, and 
tool-2. as in software. 

Consider the term tooM . The spreading activation algorithm will find its relationship and weight to other 
terms by searching the network from this point The following (linked) list will be produced. The weights are 



30 



35 



55 



tooH 


1.0 


saw 


0.7 


hammer 


0.7 


hardware 


0.5 


computer 


0.4 


software 


0.35 


software 


0.32 


code 


0.24 
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we mo SISSSIEr ^ 55f ^ § * b ° th Sl9nif iCant W0rts ^^rseSSntence has 
weignt0.35. Byobserving Figure 5, note thattherelatbnship between tool-2anrf JftL,** i.n « c «. nas 

toofc2 relates to bug. by weight 0.4. The words identify and ^(^^^Sh^TLT^ 
senses of tool. Hence tool-2 will Ka -.77^ — ! ™ l " OT snown J are ran rather distant from 



Hill-CJimb.ng Search Methods 



document sets is produced. This document «»t Hot » th* n ^h^J?^ AL ra prooucea. Katner a list of 

Automatic Acquisition of Semantic Networks 

seeFtaureT^^ 

msmm 

onJn™ ^^^^ 

S33SSSS5SSSSS 
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database entries. 



eJ^C^^TZ^ 0f W ° fdS , W Mnre P ttevel ^y bo built byreconcllingwonl meanings against 
1 . ■ T lple d,ct,onane8 and thesaurus data such as Princeton's Woid-Net fGeo ro rMin«r«f 
Princeton University, has produced a 60.000 term semantic network of bas^ng^to^^ 
method Is the abi Irty to add or merge nearly any published dictionary, whether in^peciS 

Automatic acquisition methods can be usod to build the lexical database sLaWfcT SSSLt ^ 

mZ^h™! 1^* ' * 8ntify mbSin9 Wwds ' Wentif * esoteric °f 1" specialized (kxriains. eT 
Most of these operations will be transparent to the user and wiO be accomplished as^^b^L 

lu^Z^rT 8 ' ar9e databSSe ^ M ° ^^^5^^!^ aS^ 
The Composite Dictionary 

tr n Jtll!L USt ! a , te !. th8 eX . PeCt8d resu,ting dictionary and how it will be acquired. The shaded boxes illus- 

ir^a^r^ 

the ^z^^^t^z^ .jr 100,8 to ,mport ^ lnto 

'anguage^to*^^^^ 
Tools to load words: 

' ad^l W ^wi!S d J hen st*™*™** (singuiar nouns. Infinitive verbs, un-intensified 

adjectwj and words of any other part of speech) are avaiiaMa Dictionaries are the best source for 

• Load inflected words - Inflected words are reduced to root ward* n«inn „„ m u^, „ ^ 

• Load descriptive phrases - Sometimes a descriptive phrase is Jed^Z^TZrTf' 
"notify format «s a descriptive phrase which J TconsSeT fo b^,^^^ 



Tools to load links: 



• !i!lL ne ^ hb0r i n i terrnS " WhiCh arB ^i 0 " 3 °' words which occur in the dictionary Forexamele th« 
words -happy, -happily-, and "happiness- are all neighboring terms ^ mP ' e ' t,W 

' togt a heT ate SPe,linQS - ^ ^ to the mai " For example, "color- and "colour- would be (inked 

" aTs^f^h^^ 
tTS^ 

• sssr ^^^^ 

• Unk descrfptfve phrases to their components - Words in the phrase are linked tn th« nh«^ 

For exempt "notify and Tbrmauy a» both.** totheZSpT^ Iph^^CS" ^ 

Tools to Cleanse Dictionary: 

• Remove duplicate meanings- Duplicate or closely related meanings of a word are meroed The"ci 0 ^ 

iZZZT^Z" ^ determ ' ned by ,00Wn9 into th ° ^ a "* fc nTtl?and^pu?.ng a dt 
tence factor based on the number and the weight of links required to get from one meantoota Snth^ 

™ £^ !Li spedf ied - The appropriate >» 

putmg a d,stance factor (see above) between the source meaning and all possible meanings in the des- 
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tfnation. The most closely related meaning is assumed to be the source. 
Tools to Assign Link Weights: 

Text Scanning Tools 

Is large enough, the histogram should give a aood lnd^^!L-~ZL prB - indexed >- » »e text database 

rather than words. If a word-sense occurs, then ine lWs ^word-senses 

I = - ln(M/n) 

convno^^tnl»Iiy 0r 7 ,at, 5. ,n that « w * Mn »* Word senses «ke "the" and "information" are 

eral than "house" Hence "r r^u ».^kTCL^ h8S links and is therefore more gen- 

factors wiH^np^^ to "» — of .inks it has. TheJL 

in tests. uy caromed using an adjustable empncal combination weighting that will be varied 

^Prior art used this mechanism, a!so caned Inverse Document Frequency . ow wc*s, r», word mean- 
Query Augmentation to Improve Recall - Using Syntactic and Semantic Information 

riation is usedVa^ticai s^oveTw^S^ V*?* " me Each 
^nessamongotherfec.*^^ 

database^by producing alternative ways of J^t^^Z? ^ ^ 0ther8earc "^ and 

additions proc.sjfor the augTnSylZ^ 
^^^^ 

effective query contains detailed inform«»in„ h.?f »~ T ■ . phrase "sentence. The most 

1) "requirements for the use of a CASE tod" 
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2) "What are the requirements for using CASE tools on this project?" 

3) "maintenance required for the knowledge base" 

4) "linked lists" 

aues«on^ jSiSrEr 8 n "* neC8SSari .' y r Wered> ^ tHat ** re,evant to the " «» 

question « retrieved. Natural language quenes Involving complex planning or temporal relationships such as 
Fmd me everythir^ changed 

until sophisticated "common sense" Inference engines become practical and available. It is possible thaT late 
ZlT^ * Bn ^ a " by 8 deepw "Ending inference engine that "JS 
both topical queries and structured field delimiters. «w requests inro 

Inaddrtlon, while less Interesting, queries to allow the user to fill In a document title, author date oranv 

The detailed stepsrlbr performing this retrieval process are: 
Step 1: Identify Worts, Syntactic Parsing, and Semantic Interpretation 

discuIZ aCeT" Perf0rmed U8iP9 a,9 ° rithmS 88 Whe " d0i " 9 dOCUment UldeXinfl {88e 

Step 2: Phrase Slots 

mJJSTTt? Ba f h .? Ua8e "V* ,nto different "V syntactic and semantic parsing, to help deter- 
mine their function. In the noun phrase, "the red ball", the word "the" is put in the "dete7mln«r alrT wTn 

fo^?£lS^ 

JtoT (or a confidence factor") » attached to each word in the phrase based on the slot to which ft was a*. 

SteP * Look for Closely Associated Concepts 

Each word in the user's phrase request has one or more meanings, which were determined in semantic 
mterpretotion as bestas possible. However, if the user's exact request is not in any of the docummrtsjES 
natural to extract Information that is closely related. For example. If the user asks for "raXTse^T 
sonable to also retrieve information on "jammers". ' rt seema rea " 

- JIl 6 C ° nC l PtS as ff clatod witn •** ^ the user's request are used as a storting point Then closely 
™ Wen !!! ied by traVerSin9 """n** ■ "» (semantic links identify how cxx^pt a^e reiatl) 
lhetSX^rt.y yre,ated C ° nC8PtS ^ 8 l0WerWeighti "9 si "- »he?do not 

Step 4: Weighting By Wort Specificity 

^ Jl t ^ US ^!? Sf0r " ( f ' dthin9S "' th6WOrd " thi09S " ^^diWiculty because it refers to a very wide range 
of posslWe objects, and therefore it does not help reduce the number of documents very much BecaZTf 

J£ r^^T wei9hts *** indicat6 specific or general they are These are included 
into the weighting factors determined so far. Very general concepts, such as -things", "objects" amTstofT 

" £ 8 T 'rr i9ht 8PeCifiC ^ such ae -knee cap" have a much highe^K ££' 
Only a few of these weights need be included in the document Other wort sense weigh ts can be deter- 

Search I ,n ^!f h, r , * ie8 (daSS hterarchie8 - 8pedfied semantic SESSSlSS 
Concepts lower in the class hierarchy are assumed to be more specific " 

Step 5: Index into the Concept Indexes 

H„ J" 6 S6nS ? l he user,s request < a,on 9 do^y associated concepts) are used as keys into the 
database of concepts which were built from documents which were stored and indexed. Each concept in the 
index has a list of references associated with it Each concept reference pofnte to a particular document sen- 
16 nee, ana phrase. 
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ste P 6: Weighting by Qu antity (AKA Inverse Document Frequent 

matl^rlo!? the0ry ' t 5l concep,s *«* occur most °«en are the least useful (contain the least infor- 
rZtu^ ^ Wh ^ COnSiderin9 text retrieval as we "- " ^ has a database of information 

is adlJZ^fnSi ? ^ 3 qUantity ° f 0CCurrences " Th« weighting factor of each concept 

? ?T t,,y ° f 0Ccurrenc88 < a ,a *> e number of occurrences will cause the weight to be 
reduced because a frequently occurring concept carries less information since it will be in many docTents? 

Step 7: Ranking by Proximity 

. J*""^ 8 , furtt,er ranked b V ^ close the references occur to each other. For example, if the query 
Llh^JSi" "IT** 8 *"** - thereferenc8S and an the references lor W ZTJ 

iMne^ n r P 0 '" 1 to the ^ »*— ■ then «*» Phrase is a better match than other pZses 

If multiple references occur in the same phrase, then the document is given a much higher weioht If the 
references appear further apart, the weight will be proportionately lower. nigner weight if the 

Ste P & Adjust Phrase Weights with a Fine-grain match 

The user's noun phrase query and the stored document noun phrase can be compared In detail with a 
graph matching procedure. This algorithm is described in more detail below. Atthb tte too for orLr 

IT'S? ' imit) Phrases wi » pranked, based on tJZSS. *Kl£tn^^2t 

information all the way from the start of this algorithm and were adjusL as ^ irm^iTb^^ 

thL,t^ te ^? i0 H f !f my Wi " ^^^^ Wlthln the method * Present invention as proposed for 
^ ™ ,ndudes a USef interactk>n and a user verification component The user lnti»racHoT«^ 

P™ rta llowsthe^toresporKj^ 

word meanmgs. The user verification component is more complex and allows the^ o?Z2 

tZT™* CO ?T nt ST 68 tne Hm ° re ^ ired for *»™nt retrieval. *^££££Z 
breader query, or further restrict the current query with an additional natural language delimiter. Tte veX 
catjon step may not be required in most systems, depending on machine speed and numb^ o^ton^s 



To process the user's query the system augments the user's query. This augmentation begins with the 
parsed query, including head words of key phrases and describes associated wrth those wonSTwefaht fe 
asskjned to each slot in the query phrase templates based on a generic template ^b^Tn^^to 

vbtotdtTa^ ?° a r entetlon then *** "« * ^X^^^^ 

via spreading activation from the semantic word sense network. 

The augmented query is then used to reference the concept index and the document reference files. A 

o^TreTaS ZI ^iT^^" 8 ^ ^ * '"a semantic sense network, the syS 
posrtk.n relative to the query, the modifiers used in association with the head word, and a number of heuristic 
check questions. The weighting factor adjustments will be determined empirically during installation 

Natural Language-Based Routing 

f^l^TfJ me presen * ^ e " tion 035 a ^ example" feature (also known in the art as relevance 
T 8 ! U. 8 d0CUment Sim " ar to the 006 beino viewed - 71,9 M 'anguagl prS 
had t0 be wntten by compound Boolean expressions, profiles may now be documents or oor- 
K,ns thereof, or user wntten as English descriptions of a few t^toa<ewp.n^n^lnta«lh D^iS- 
ing operations the present invention indexes the English profiles as docJents Inbound Zs ormcSagt 
are V" 8 ^ 88 aueries "" 1,1056 P^i'es "retrieved" indicate to whom the items will be sent 
Conr?x e u qU ^^ eXamPl8 I e !l Ur8 °I the PreSOnt inVenti ° n 1,6 dassified as a "context vector" approach, 
then^ h^r 8 « ,eCt, ° n ° f termS toflether in Unlike »>** <™text vector appSes. 

„„t , ^ ,nVenli ° n C ° nteXt VeCtor « a collection of wore meanings used together in contort, 

not just a collect™ of terms. In addition, the method of the present invention includes the physicaTZring 
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of the terms in the original profile into the context vector. The context vector, or representation of the query 
document is matched against the context vector of stored documents via the index. Documents with hiqh 
enough similar content are chosen to be retrieved. 

The present invention presents two significant advances: First, the 'context vector" is at the word meaning 
level, and second, for routing applications, the end user writes a plain English abstract which is indexed as a 
document 1 . then inbound news wires are treated as queries, reducing the amount of repetitive processing 
required. 

Additional speed can be gained by pre-expanding the words in the English profiles. These expanded terms 
are indexed along with the original terms in the profile. When an incoming document must be routed, its terms 
need not be expanded (as it would be in the original query mode), and so routing process is now much faster. 

Integrated ranking of documents based on 6 composite factors 

Most modern text retrieval systems attempt to control precision in part by ranking the order of the docu- 
ments that are retrieved. There are as many ranking schemes as there are systems that rank documents. Most 
of these systems rank the documents on the frequency of occurrence of the terms In the query. Some systems 
abo take into account the inverse document frequency of terms. Yet other systems rank on position giving 
higher weight to terms that appear in the title or leading paragraph of a document 

The method of the present Invention has the ability to rank on a multitude of factors simultaneously In- 
cluding those factors mentioned above combined with several new methods based upon linguistics. The main 
novel feature is the ability to "tune" the ranking based on all these factors and to easily add other factors when- 
ever ^needed. Different document collections can be ranked on criteria that are optimized to them. 

This approach may be summarized as follows: A concept level "inference network" is used to match the 
concept sought after to concepts represented in the text This inference network computes a matching score 
based upon evidence from 6 different sources. More or fewer sources could be used in the same mechanism. 
The importance of the factors used is the inference networks is determined by the statistics of the database 
being searched. The factors count information based upon word meanings, not Just words, and linguistic In- 
formation, notjust statistical, is taken into account 

k f^-u??" * ^merits means sequencing the documents returned by a query so that documents which 
best match the user's query are displayed first The ranking algorithm is the heart of the decision making proc- 
ess and is therefore one of the most important algorithms in any information retrieval system. The present 
invention has a very sophisticated ranking algorithm that uses six criterion for ranking documents based on 
the terms in the expanded query. These are defined below. Following is a definition of the algorithm to be 
used for combining these factors. a 

1. Semantic Distance. Documents which contain exact matches on the original words in the query are 
ranked higher than documents which contain only related terms. 

2. Proxi mity, tf the matching terms in a document occur close together, then the document is ranked higher 
than if the matching terms are spread widely over the document 

3. Completeness. Documents are ranked higher if they "completely" represent the query, that Is. the docu- 
ment should contain all of the terms from the query, or at least one related term for each term In the query 

4. Quantity. Documents are ranked higher if it contains many hits on the terms in the expanded query 
5 Older and Syntax, tf the order of the terms in the document is the same as the order of the terms in 
the query, the document is ranked slightly higher than others in the same class. When the syntax modules 
of the present invention are completely integrated, more advanced mechanisms for matching the syntax 
of the query against the syntax of the matching terms in the document can be employed 

6. Term Specificity and Information Content Certain terms, such as "stuff', "things", and 'Information- 
are especially vague and are therefore not reliable index terms. Documents and queries which contain 
these terms are ranked lower. Other terms are weighted by information theoretic measures 
The ranking algorithm proceeds as follows. Each query is dissected into "atomic" query fragments These 
■fragments" are words for Boolean or statistical queries, but are phrases or even short sentences for natural 
language queries. For each "fragment", the "evidence" of each occupancy of that fragment in the document 
is assessed. Then, the total evidence for the fragment is calculated. Finally, the evidence for the presentation 
of a document is calculated by combining the fragment evidence. Thus we have a 4-step process. 
Step 1. Find the query fragments, Q' 

Step2. Evaluate P',, the evidence that Q' appears in the "jth" position in the document 
Step 3 - Compute E', the combined evidence for Q, calculated from P",. 
Step4. Combine B for all QJ into one single evidence value E for the document 
Each of these steps will be explained in further detail below. 
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S 5*T .I? ' 1 ^ " """P 0 "™ 1 (multiple sentences, or use of widely scoped conjunctions 

t SIS PrSTSr"? Q ' " faMnd by h6UriStia QU6ry by examp,e W,U «SntaInmanyS 
w u ° te P 1 ! for Q 1 , it is frst necessary to find the "jth" position. For word queries it is simply the 

istaken. where k- n times the length of Of. These windows must be overlapped by 50% to be accurate hence 

To calculate P, requires several factors. P) will be allowed to vary from O to U where U is an „n M i im » 
determined by document posits (typfcal* linear the title orZing pC^h^ O.^Z 

oosiSI^Sl' 1 ^ ^^'^"^^s^rd specificity, inverse document frequency and syntactic 

» ^^ faC ^ S(Semanticdistanc8 ' syntacticorder). An "untested" oTrTbeen^ 

posed, but will not be presented here due to its tentative status. one nas Deen pro- 

To combine PJ to get P, use the following procedure: 

1) If P^T, reset P} = 0 

(to remove "noise") 

2) 1-E>=_(1-py) 

^Adjust for document length by an empirical Tm inula to Be determined 

frnJ^tT!!^' * fe formed by """^ ^"es of E as follows. Frst reduce E' by If where < varies 
from 0 tojfer bread (OR) searches to narrow (AND) searches, respectively. Then^ue P to^er^ 

E=E*(1-E')+E1 

etc, until ail k values are used. 

Private concept search by conceptual graphs 

,™» r^t^"^* 'T* ° V8rc0mes the "nations of Prior art systems in three distinct ways: First 
the us« describes relationships between concepts by relationship type. noTarbitrary nuZZdil 
d^be^retetionshipistowordrneanirigsorcor^^^ 

concepts need to be defined. Most concepts already exist in the dictionary^ ^ specialfeed 

n^rtS^TJS^^! 8 ^ 9raphi °' 081119 s P ecjaJ -Punpose software or constructing a file according to a 
predefined specification. This conceptual graph then gets attached to the underlying semantic neSnrL 

^S^TZS* T ^ ^ 8 P^eternSS^h fT^p^ 

testing. Otherw.se the method works as usual, by plain English concept based processing 

References to External Objects 

d e «^^!l 8 ^ mS Stomdre * 3fences in the i"*«es to the text databases which were used to build the in- 

ent iH!!„Ti^ 0d ? the P' esent, ' nventionc «n storearbKrary referencedata in the indexes. Thisallows the pres- 
ent inven ion to store references any kind of object, such as worts in an image (a bit-map representation^ 
document page), features in a picture, points in a audio presentation, points to a video presSSS ete 

intelligent Hypertext 



high accuracy of search and retrieval of the method of the present invention enables documents to 
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be "browsed" without the need for pre-established hypertext links. During the browse mode operation of the 
STpretK 



Machine Abstracting 



«r c™.? u ! th0d ° f the Pr9Sent invention ' additions, can be used to automatically create summaries 

of E "f£^™t8 (callod "machine abstracting"). Thfe is done with the foilowir^ ™ 

IT^TJ ™ th f dOCUm8nt 11,18 lndudes ^eniratlon. dictionary tolp^Lgy. syntax 

3) Determine the most frequent concepts in the document, using histoarams or soma oth«r whni»« 

4) Construct the abstract by excerpting sentences from the original document Sentences conteinino th« 

general domain concepts, not explicitly mentioned, will be recognized. For exLple! aV^n^T^ZT 
or terms will recognize this concept as relevant to the document as a whole. 
Statement of Industrial Utility 

,. J 1 * invention »» use ^"> In search and retrieval tasks related to computer databases. It is oar 

fcularly suited for enhanced precision and recall In textual databases rocom P^ databases. It is par- 



Claims 
1. 



t!!! th ° d J 0r S !!^ hin9 8 Com P uter dateb a*» containing one or more documents comprised of symbols 
ZZ°:*Z? ,nf0rmati0n 6XPreSSed in 8 ,8n9Ua9e b ""^ndabfc to SZn^S££ 
qUeryCOnlprisin 9 0"e or more symbols representative of information comprising one 
(b) determining one or more likely meanings for each term in said query- 
S2 « nt ^," 9 '" rank or more of said likely meanings in said database; 

d!g^^mp a ^ d,Cati0n * ^ id9ntitfeSOf8aid identifted *^e"t« h the memory of a 

^!^' h l d *? aim 1 wherein Mid s teP W of accepting a query comprises accepting a document com- 
posed of symbols representative of information expressed in a language which is untoste^bte to\ Z 
man users said document containing information which intended by a user to be 1^ 
contained in said identified individual documents. information 

IlT^*^ 1 St6P (a) rfa ^P« n 9 a ^ uprises identificatfonof one or more 

subsets of symbols representative of information expressed in a language which is understandabTS 
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hum» «™. „*» eonwto „„.„„«», a. ^ saM 

users compris- 

qUery COmPriSin9 ° ne " m0re -«tat*e of information comprising one 

SSSSCX ^^ST ^ -tain at .east one aa* likely meanings 

which is understondabie to hu^„ ^Tc^^^X^ 2 rmat '° n ' n * 

(a) accepting a query comprising one or more symbols reDresentat™ m ir*™^ 

!S iH^7 ,lnin ^ °" e ? Hkely "^"'"S 8 for «* te™ in said query; 
d £»S S ra " k ^ 0 "^-"«e of »aid like* meanings in Ld database; 

Sf^ 

fl storing an indication of the identities of said identified individua. sections in the memory of a digital 
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computer. 

14. A method for constructing a combination associative network of term meanings and machine-readable 
dictionary from a plurality of machine-readable linguistic databases comprising the following steps- 

(a) identifying root words and their associated meanings; 

(b) identifying non-root words and their associated meanings, and identifying their relationships to the 
root words identified in step (a); 

(c) identifying descriptive phrases and idioms and their associated meanings, and identifying their re- 
lationships to the root words identified in step (ay. 

(d) identifying link relationships among the root words, non-root words, descriptive phrases, idioms 
and their associated meanings Identified In steps (a-c); 

(e) determining link strengths for each link relationship Identified In step (d)- and 

(f) storing saidjoot words, non-root words, descriptive phrases, idioms, meanings, links, and link 
strengths in the memory of a digital computer. 

15. A method for real-time characterization of source documents comprised of a plurality of symbols repre- 
sentative of information expressed in a language which is understandable to human users comprising 
the steps oft ° 

(a) identifying terms within said source document; 

{^searching a database containing documents comprised of one or more terms and enhancement in- 
formation associated therewith; 

(c) identifying documents within said database containing one or more of said identifed terms- 

(d) associating the enhancement information associated with said identfied documents with said 
source document 

16. A method for enhancing the content of a document comprised of symbols representative of information 
expressed in a language which is understandable to human users comprising the steps of: 

(a) identifying one or more terms comprised of one or more symbols within said document; 

(b) determining one or more likely meanings for each term in said document 

(c) Identifying in rank order one or more of said likely meanings of said identified terms 

(d) optionally identifying additional likely meanings which are related to said likely meanings identified 
in step (c); and 

(e) storing said document, said identified likely meanings, and said identified additional likely mean- 
ings in the memory of a digital computer. 

17 ™ e _ method ofClaim 16 wherein said database contains documents comprised of symbols representative 
of information expressed in a language which is understandable to human users, and wherein said da- 
tabase is constructed according to a method comprising the steps oft 

(a) identifying one or more terms comprised of one or more symbols within each of said documents 

(b) determining one or more likely meanings for each term in each of said documents; 

(c) identifying in rank order one or more of said likely meanings of said identified terms; 

(d) optionally identifying additional likely meanings which are related to said likely meanings identified 
in step (c); and 

(e) storing each of said documents, said identified likely meanings, and said identified additional likely 
meanings in the memory of a digital computer as said database. 

18. A method for indexing a document comprised of symbols representative of information expressed in a 
language which is understandable to human users comprising the steps of: 

(a) identifying one or more terms comprised of one or more symbols within said document; 

(b) determining one or more likely meanings for each term in said document 

(c) identifying in rank order one or more of said likely meanings of said identified terms 

(d) determining the informational value of each of said likely meanings, and discarding those likely 
meanings having an informational value which is less than a predetermined value; and 

(e) storing said document and said identified likely meanings in the memory of a digital computer 
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